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Chapter  1 


Introduction 

The  behavior  of  many  multimedia  applications  can  be  characterized  by  pat¬ 
terns  of  stream  processing  computation  and  modeled  efficiently  using  dataflow  mod¬ 
els  of  computation.  In  multimedia  and  other  signal  processing  intensive  application 
domains,  dataflow  graph  models  are  widely  used  to  describe  applications  because  of 
their  natural  correspondence  to  signal  flow  graphs,  and  important  forms  of  compu¬ 
tational  structure  that  are  exposed  by  such  models. 

A  dataflow  graph  is  a  directed  graph,  where  vertices  ( actors )  represent  com¬ 
putational  functions  and  edges  represent  inter-actor  communication  channels  that 
buffer  data  in  a  first-in  first-out  (FIFO)  fashion.  Dataflow  actors  can  contain  com¬ 
putations  with  arbitrary  complexity  as  long  as  the  interfaces  of  the  computations 
conform  to  dataflow  semantics.  That  is,  actors  produce  and  consume  data  from  their 
input  and  output  edges,  respectively,  and  each  actor  execution  (firing)  depends  on 
the  availability  of  sufficient  data  from  the  input  edges  of  the  associated  actor. 

When  implementing  a  dataflow-based  multimedia  application  model  on  a  tar¬ 
get  platform,  scheduling  plays  an  important  role  (e.g.,  see  [2]).  Here,  by  scheduling, 
we  refer  to  the  process  of  determining  which  processing  resource  each  actor  executes 
on,  and  the  ordering  of  execution  among  actors  that  share  the  same  resource.  By 
affecting  key  metrics  that  include  performance,  and  memory  usage,  scheduling  often 
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has  significant  impact  on  implementation  quality. 

For  dataflow  models  of  large-scale  digital  signal  processing  (DSP)  applications, 
the  underlying  graph  representations  often  consist  of  smaller  sub-structures  that  re¬ 
peat  multiple  times.  Topological  patterns  have  been  shown  to  enable  more  concise 
representation  and  direct  analysis  of  such  substructures  in  the  context  of  high  level 
DSP  specification  languages  and  design  tools  [23] .  Furthermore,  by  allowing  design¬ 
ers  to  explicitly  identify  such  repeating  structures,  use  of  topological  patterns  pro¬ 
vides  an  efficient  alternative  to  automated  detection  of  such  patterns,  which  entails 
costly  searching  in  terms  of  graph-isomorphism  and  related  forms  of  computation. 
A  topological  pattern  is  inherently  parameterized  and  provides  a  natural  interface 
for  parameterized  scheduling,  which  enables  efficient  derivation  of  adaptive  schedule 
structures  that  adjust  symbolically  in  terms  of  design  time  or  run-time  variations. 

1.1  Contributions  of  this  thesis 

In  this  thesis,  we  present  a  formal  design  method  for  specifying  topological 
patterns  and  deriving  parameterized  schedules  from  such  patterns  based  on  a  novel 
schedule  model  called  the  scalable  schedule  tree.  Our  method  ensures  deterministic 
behavior  of  the  system  based  on  compile-time  analysis  of  system  behavior  that  may 
contain  structured,  parameterizable  patterns  of  actors  and  edge  instantiations.  We 
have  also  developed  an  associated  software  plug-in  and  integrated  it  into  the  dataflow 
interchange  format  (DIF)  framework  and  the  associated  cross-platform  design  and 
synthesis  environment  called  targeted  DIF  ( TDIF )  [9,  24],  TDIF  is  a  companion 
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design  tool  of  the  DIF  framework  that  supports  dynamic  dataflow  analysis,  cross¬ 
platform  actor  design,  and  code  generation  on  targeted  platforms  [24], 

1.2  Outline  of  the  thesis 

The  rest  of  the  thesis  is  organized  as  follows.  Chapter  2  provides  summary 
of  background  on  dataflow  graph  modeling,  schedule  representation,  and  design 
tool  development  as  well  as  related  work  on  parameterized  scheduling.  Chapter  3 
presents  the  formalization  of  our  proposed  schedule  model  scalable  schedule  tree. 
Chapter  4  presents  the  integration  of  our  design  method  into  the  DIF  framework  and 
the  associated  TDP  software  package.  Chapter  5  demonstrate  the  contributions  of 
this  thesis  using  a  case  study  of  an  image  registration  application.  Lastly,  conclusions 
and  future  work  are  discussed  in  Chapter  6. 


3 


Chapter  2 


Background  and  Related  Work 

2.1  Background 

In  this  section,  we  summarize  background  on  dataflow  graph  modeling,  sched¬ 
ule  representation,  and  design  tool  development  that  we  build  on  in  this  thesis. 

2.1.1  Topological  Patterns 

For  large-scale  models  of  signal  processing  applications,  the  underlying  dataflow 
graph  representations  often  consist  of  smaller  substructures  that  repeat  multiple 
times.  A  method  for  scalable  representation  of  dataflow  graphs  using  topological 
patterns  was  introduced  in  [23].  Topological  patterns,  such  as  the  ring ,  butter¬ 
fly ,  and  chain  patterns,  are  pervasive  in  signal  processing  applications,  including 
multi-dimensional  signal  processing  systems,  where  processing  of  large  scale  dataflow 
structures  is  common. 

Topological  patterns  enable  concise  representation  and  direct  analysis  of  sub¬ 
structures  in  the  context  of  high  level  DSP  specification  languages  and  design  tools. 
Modeling  based  on  topological  patterns  also  provides  a  scalable  approach  to  spec¬ 
ifying  regular  functional  structures  that  is  formally  integrated  with  the  framework 
of  dataflow.  This  integration  allows  not  only  for  specification  of  functional  pat¬ 
terns,  but  also  for  their  analysis  and  optimization  as  part  of  the  larger  framework 


4 


of  dataflow. 


For  more  details  on  modeling  and  design  based  on  topological  patterns,  we 
refer  the  reader  to  [23]. 

2.1.2  Generalized  Schedule  Trees 

The  generalized  schedule  tree  ( GST)  is  a  compact,  tree-structured  graphical 
format  that  can  represent  a  variety  of  dataflow  graph  schedules  [13].  In  GSTs,  each 
leaf  node  refers  to  an  actor  invocation,  and  each  internal  node  n  (called  a  loop  node ) 
is  configured  with  an  iteration  count  In  for  the  associated  sub-tree,  where  execution 
of  the  sub-tree  rooted  at  n  is  repeated  In  times. 

The  GST  has  been  demonstrated  to  represent  looped  schedules  for  dataflow 
graphs  effectively  in  the  context  of  static,  non-scalable  schedules  (e.g.,  see  [13]).  In 
this  thesis,  we  go  significantly  beyond  the  capabilities  of  GSTs  by  formulating  and 
implementing  a  novel  schedule  tree  model  for  representing  scalable  schedules  (i.e., 
schedules  that  symbolically  accommodate  variations  in  the  numbers  of  actors  and 
edges  in  the  associated  dataflow  graphs).  We  refer  to  this  new  form  of  schedule 
tree  as  the  scalable  schedule  tree  ( SST)  model.  We  demonstrate  the  utility  of  SSTs 
in  the  synthesis  of  software  from  topological  pattern  models  of  signal  processing 
applications. 
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2.1.3  The  Dataflow  Interchange  Format 


The  Dataflow  Interchange  Format  (DIF)  framework  provides  a  standard  lan¬ 
guage,  i.e.,  The  DIF  Language  ( TDL ),  for  specifying  semantics  of  a  broad  class  of 
dataflow  model  of  computations  for  signal  processing  applications  [9].  Forms  of 
dataflow  semantics  that  can  be  expressed  using  TDL  include  graph  topologies,  hier¬ 
archical  design  structures,  dataflow- related  design  properties  (e.g.,  delays,  produc¬ 
tion  rates,  consumption  rates,  etc.),  and  actor-specific  information.  The  associated 
software  package  in  the  DIF  framework,  called  The  DIF  Package  ( TDP ),  provides 
intermediate  representations  for  dataflow  graphs  that  are  specified  by  TDL,  along 
with  libraries  of  analysis  techniques  and  transformations  that  operate  on  these  rep¬ 
resentations.  The  analysis  techniques  can  be  used  to  enhance  dataflow-based  de¬ 
sign  flows  based  on  TDL  or,  through  generalized  interchange  capabilities  provided 
by  DIF,  based  on  other  dataflow  environments  that  are  interfaced  to  DIF  (e.g., 
see  [11,  8,  24]). 

In  this  thesis,  we  demonstrate  the  implementation  and  integration  into  the  DIF 
framework  of  1)  topological  patterns  for  representing  large-scale  dataflow  graphs 
using  TDL,  and  2)  SST  representations  for  modeling  and  manipulation  of  param¬ 
eterized  schedules  based  on  topological  patterns.  Our  implementation  of  the  SST 
representation  is  integrated  with  the  Targeted  Dataflow  Interchange  Format  ( TDIF) 
environment  for  generating  platform-specific  code  from  DIF  models  [24], 

More  specifically,  this  thesis  builds  on  the  developments  of  [23,  9,  24]  by  in¬ 
troducing  the  SST  model  described  above  for  expressing  parameterized  schedules 
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based  on  topological  patterns.  We  also  introduce  a  new  syntax  for  TDL  that  pro¬ 
vides  a  compact  way  for  specifying  topological  patterns.  Furthermore,  we  have 
developed  a  novel  plug-in  to  TDP  for  generating  code  from  our  new  schedule  model, 
and  thereby  deriving  platform-specific  code  from  TDL  programs  that  include  speci¬ 
fications  of  topological  patterns.  Designers  can  use  this  tool  for  experimenting  with 
parameterized  scheduling  and  automated  synthesis  of  implementations  from  scalable 
dataflow  graph  models. 

2.2  Related  Work 

Parameterized  schedules  have  been  studied  before  (e.g.,  see  [f,  13]),  and  pre¬ 
viously,  production  and  consumption  rates  were  key  dataflow  graph  aspects  that 
were  used  to  generate  parameterized  schedules.  In  topological  patterns,  even  if 
production  and  consumption  rates  are  fixed,  the  schedule  is  still  scalable  in  terms 
of  the  numbers  of  actors  and  edges.  Such  scalability,  when  formulated  in  term 
of  topological  patterns,  leads  to  new  opportunities  and  constraints  for  developing 
parameterized  scheduling  techniques. 

Early  work  on  parameterized  scheduling  for  dataflow  graphs  was  done  in  the 
context  of  parameterized  dataflow  representations.  Parameterized  dataflow  is  a 
meta-modeling  technique  that  can  be  applied  to  any  underlying  “base”  dataflow 
model,  such  as  SDF  [15],  CSDF  [3],  FRDF  [18],  and  BDF  [4],  for  dynamically  re¬ 
configuring  the  behavior  of  dataflow  actors,  edges,  subsystems,  and  graphs  through 
dynamic  reconfiguration  of  parameter  values  [1] .  Quasi-static  scheduling  techniques 
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were  developed  for  parameterized  synchronous  dataflow  (PSDF)  specifications,  which 
is  the  integration  of  the  parameterized  dataflow  meta-model  with  SDF  as  the  base 
model.  By  quasi-static  scheduling ,  we  mean  the  derivation  of  schedule  structures 
that  are  largely  fixed  at  compile  time,  with  relatively  small  numbers  of  decision 
points  or  symbolic  adjustments  made  at  run-time  based  on  the  values  of  relevant 
input  data.  This  approach  to  PSDF  scheduling  improved  the  flexibility  with  which 
SDF  techniques  can  be  applied,  and  allowed,  for  example,  dynamic  adjustments  to 
schedules  as  dataflow  (token  production  and  consumption)  rates  vary  at  run-time. 
However,  in  this  work,  parameterized  scheduling  for  scalable  topologies  was  not 
addressed  —  the  underlying  sets  of  actors  and  edges  were  assumed  to  be  fixed. 

The  reactive  process  networks  (RPN)  model  of  computation  supports  the  con¬ 
struction  of  analysis  and  synthesis  tools  for  dynamic  streaming  multimedia  applica¬ 
tions  that  include  both  event-based  and  dataflow-based  computations  [7] .  RPN  pro¬ 
vides  an  integration  framework  with  run-time  reconfiguration  for  event  and  stream 
processing  which  is  flexible  to  handle  run-time  scheduling  decisions  and  may  also  be 
used  to  represent  non-deterministic  stream  processing  behaviors. 

Using  the  parameterized  Kahn  process  network  ( PKPN)  model,  designers  can 
analyze  the  behavior  of  a  parameterized  system  at  runtime  based  on  self-timed 
scheduling  without  introducing  non-deterministic  behaviors  [17].  PKPN  also  auto¬ 
mates  the  design  process  through  integration  into  the  Compaan/Laura  tool  [25]. 

The  operational  semantics  of  the  RPN  and  PKPN  models  can  be  viewed  as 
extensions  of  the  Kahn  process  network  (KPN)  modeling  framework  [10],  where 
processes  execute  concurrently,  applying  blocking  reads  to  assess  availability  of  data 
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on  their  inputs,  and  control  is  incorporated  into  processes  in  a  distributed  fashion 
without  use  of  a  global  scheduler.  While  these  models  lead  to  flexible  and  efficient 
execution  of  KPN-related  models,  they,  like  the  parameterized  dataflow  framework, 
do  not  address  the  scheduling  of  scalable  topologies. 

In  summary,  our  work  addresses  the  issues  of  parameterized  scheduling  for 
scalable  topologies,  and  introduces  a  novel  schedule  model  that  provides  for  intuitive 
representation  and  efficient  code  generation  for  our  targeted  class  of  parameterized 
schedules.  Adapting  the  parameterized  scheduling  models  and  methods  from  this 
work  to  the  frameworks  of  parameterized  dataflow,  RPN,  and  PKPN  are  interesting 
directions  for  further  study. 
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Chapter  3 


Scalable  Schedule  Trees 

In  this  section,  we  build  on  the  GST  representation,  and  develop  a  new  for¬ 
mal  method  to  formulate  and  represent  a  class  of  parameterized  schedules.  This 
targeted  class  of  schedules  is  useful  for  implementing  dataflow  graph  models  that 
employ  topological  patterns,  as  we  demonstrate  in  subsequent  sections  of  this  thesis. 
Our  new  model  for  schedule  representation  is  significantly  more  powerful  than  the 
original  GST  formulation,  and  as  a  target  for  scheduling  techniques,  this  new  model 
enables  the  development  of  correspondingly  more  powerful  schedulers. 

A  scalable  schedule  tree  (. SST )  has  all  of  the  features  of  a  GST  (see  Section  2), 
and  additionally  provides  the  following  new  features. 

1.  Parameterization.  An  SST  has  an  associated  parameter  set  K.  Nodes  within 
the  schedule  tree  can  be  parameterized  in  terms  of  this  parameter  set  (we  will  de¬ 
scribe  this  in  more  detail  below).  The  semantics  of  how  SST  parameters  (i.e. ,  values 
associated  with  elements  of  K)  change  is  not  specified  in  the  SST  model;  rather,  it 
is  determined  by  the  model  of  computation  that  is  used  for  application  specification 
(e.g.,  SDF  with  static  graph  parameters  [14],  parameterized  dataflow  [1],  or  scenario 
aware  dataflow  [26]),  in  conjunction  with  the  scheduling  strategy  that  is  used  to  de¬ 
rive  the  schedule  tree.  This  decoupling  from  parameter  change  semantics  allows 
the  SST  model  to  be  applied  to  a  variety  of  different  kinds  of  dataflow  application 
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models  and  design  environments. 

2.  Guarded  execution.  An  SST  leaf  node,  which  encapsulates  a  bring  of  an  indi¬ 
vidual  actor,  has  an  optional  guarded  attribute,  which  indicates  that  firing  of  the  cor¬ 
responding  actor  should  be  preceded  by  a  run-time  fireability  ( enabling )  check.  Such 
an  enabling  check  determines  whether  or  not  sufficient  input  data  is  available  for  the 
actor  to  fire.  If  sufficient  input  data  is  not  available,  the  firing  is  aborted  —  i.e.,  the 
corresponding  actor  is  effectively  “skipped”  during  the  current  visitation  of  the  leaf 
node.  The  guarded  attribute  of  SSTs  is  motivated  by  the  enable-invoke  dataflow 
model  of  computation,  where  guarded  executions  play  a  fundamental  role  [20]. 

3.  Dynamic  iteration  counts.  Loop  nodes  can  be  dynamically  parameterized  in 
terms  of  SST  parameters,  which  provides  capabilities  for  data-  or  mode-dependent 
iteration  in  schedules.  An  SST  loop  node  L  can  be  viewed  as  a  parameterizable  form 
of  the  constant-iteration-count  loop  nodes  in  GSTs.  An  SST  loop  node  L  has  an  as¬ 
sociated  iteration  count  evaluation  function  cl  '■  K  — »■  Z+.  An  implementation  of  cl 
takes  as  arguments  zero  or  more  of  the  parameters  in  K,  and  returns  a  non- negative 
integer  (zero  parameters  are  used  if  the  iteration  count  is  constant).  Visitation  of  L 
begins  by  calling  cl  to  determine  the  iteration  count,  and  then  executing  the  subtree 
of  L  successively  a  number  of  times  equal  to  this  count. 

4.  Arrayed  children.  In  addition  to  leaf  nodes  and  SST  loop  nodes,  there  is  a  third 
kind  of  internal  node  called  an  arrayed  children  node  (ACN).  ACNs  are  perhaps  the 
most  distinctive  aspect  of  SSTs,  and  the  most  closely  related  to  topological  patterns. 
These  are  discussed  in  more  detail  in  the  following  subsection. 
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3.1  Arrayed  Children  Nodes 


An  ACN  z  has  an  associated  parameter  set  Pz.  Each  p  E  Pz  in  turn  has 
an  associated  evaluation  function  fp  :  K  — >•  up,  where  up  is  the  set  of  admissible 
values  (parameter  domain)  of  p,  and  again,  K  is  the  parameter  set  of  the  associated 
schedule  tree. 

An  ACN  2  has  an  associated  array  children^,  which  represents  an  ordered 
list  of  candidate  children  nodes  during  any  execution  of  the  SST  subtree  rooted 
at  2.  For  simplicity,  we  assume  that  children  2  is  a  one-dimensional  array,  but  the 
associated  formulations  can  easily  be  extended  to  handle  multi-dimensional  arrays 
of  candidate  children.  The  array  children^  has  a  positive  integer  size  sizc2 ,  which 
gives  the  number  of  elements  in  the  array.  It  is  assumed  that  the  array  is  indexed 
starting  at  0. 

Each  element  in  children^  represents  a  schedule  tree  leaf  node  (i.e. ,  an  encap¬ 
sulation  of  an  actor  in  the  enclosing  dataflow  graph),  an  SST  loop  node,  or  another 
SST  —  i.e.,  a  “nested”  SST.  An  ACN  z  also  has  three  functions  associated  with 
it,  which  we  denote  as  cinit ,  cstep, ,  and  climit2,  that  determine  how  children  z  is 
traversed  during  a  given  execution  of  the  enclosing  subtree.  These  functions  take 
as  arguments  pre-specified  subsets  of  the  parameters  of  z,  and  return,  respectively, 
a  non-negative,  positive,  and  non-negative  integer.  One  or  more  of  these  functions 
can  be  constant-valued  —  dependence  on  parameter  settings  is  not  essential  but 
rather  a  feature  that  is  provided  for  enhanced  flexibility. 
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3.2  SST  Traversal  Process 


When  an  ACN  z  is  visited  during  traversal  (execution)  of  the  enclosing  sched¬ 
ule  tree,  the  following  sequence  of  steps,  called  the  SST  traversal  process,  is  carried 
out. 


(1)  The  parameter  settings  for  2  are  updated  by  applying  the  evaluation  function 
fp  for  each  parameter  p  G  Pz. 

(2)  The  values  of  cinitz,  cstep. ,  and  clirnit.,  are  evaluated  in  terms  of  the  updated 
parameter  settings.  These  values  are  stored  in  temporary  variables,  which  we  denote 
as  I,  s,  and  L,  respectively. 

(3)  The  computation  outlined  by  the  pseudocode  shown  in  Algorithm  1  is  carried 
out,  where  A  represents  the  array  children^;  count  represents  the  iteration  count 
evaluation  function  of  the  associated  SST  loop  node;  and  K  represents  the  set  of 
parameters  for  the  enclosing  SST. 


Algorithm  1  Outline  of  the  SST  traversal  process. 


for  (i  =  I;  i  <=  L;  i  +=  s)  { 
if  A[i]  is  a  leaf  node  { 

execute  the  actor  encapsulated  by  A[i] 
}  else  if  A[i]  is  an  SST  loop  node  { 

Z  =  count (K) 

execute  the  loop  node  subtree  Z  times 
}  else  {  //  A[i]  is  a  nested  SST 

recursively  apply  the  SST  traversal 
process  to  A[i] 

> 


> 


A  generalization  of  SSTs  can  be  envisioned  in  which  arrays  of  candidate  chil¬ 
dren  are  replaced  by  lists,  and  the  visitation  process  for  a  generalized  ACN  g  starts 
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Figure  3.1:  An  example  of  an  SST. 

by  applying  a  function  g,  which  takes  parameter  settings  for  Pg  as  arguments,  and 
returns  a  list  of  children  in  the  order  that  they  should  be  visited.  Exploring  such 
generalized  SSTs  for  more  complex  schedule  control  is  an  interesting  direction  for 
further  study. 

Figure  3.1  shows  a  synthetic  example  of  a  nested  SST  to  help  illustrate  the 
SST  model.  In  Figure  3.1,  A_ACN  and  ELACN  are  ACNs.  Suppose  that  the  evaluation 
results  of  cinit,  cstep,  and  climit  for  A_ACN  and  ELACN  are:  cinit  ^  =  0,  cstep  A  =  1, 
dimity  =  1,  cinit b  =  1,  cstep R  =  2,  and  climit b  =  4.  Then  the  scheduling  result 

from  traversing  the  SST  by  following  the  SST  traversal  process  is 

S  =  B1  B3  A  A  A  B1  B3  A  A  A. 

This  traversed  schedule  S  shows  the  sequence  of  actor  executions  that  results 
from  traversing  the  given  SST. 

3.3  Relationship  to  Scalable  Dataflow 

The  form  of  scalability  provided  by  SSTs,  which  can  be  viewed  as  topological 
scalability,  is  orthogonal  to  that  provided  by  the  scalable  dataflow  concept  introduced 
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by  Ritz,  Pankert,  and  Meyr  [21].  The  two  techniques  can  be  applied  independently 
or  jointly.  In  scalable  dataflow,  the  objective  is  to  execute  block-processing  versions 
of  actors.  Each  scalable  dataflow  actor  is  programmed  in  terms  of  a  vectorization 
degree  N,  which  represents  the  number  of  firings  of  the  actor  that  are  executed 
together.  This  allows  such  an  actor  to  process  data  in  blocks  of  N  units,  and 
furthermore  to  carry  out  internal  computations  in  such  a  block-processed  way,  which 
can  provide  significantly  increased  throughput  and  data  locality,  possibly  at  the 
expense  of  latency  and  buffer  memory  requirements  [22,  12]. 

While  Ritz  presents  scalable  dataflow  in  the  context  of  SDF,  referring  to  the 
model  as  scalable  SDF  or  SSDF,  the  underlying  form  of  scalability  is  more  gen¬ 
eral  and  can  be  applied  to  arbitrary  application  programming  interfaces  (APIs)  or 
software  synthesis  frameworks  for  signal  processing  dataflow  graphs.  This  form  of 
vectorization-oriented  scalability  can  be  applied  in  conjunction  with  SSTs  by  having 
leaf  nodes  represent  vectorized  executions  of  the  corresponding  actors.  Note  that 
constructing  an  ACN  with  size  equal  to  the  vectorization  degree  N  will  not  have 
an  equivalent  effect  because  control  will  alternate  between  each  (scalar)  actor  firing 
and  the  control  associated  with  ACN  visitation  instead  of  remaining  dedicated  to 
the  actor  for  N  consecutive  firings  as  scalable  dataflow  ensures. 

Vectorization  (scalable  dataflow)  can  be  applied  flexibly  within  SSTs.  For 
example,  an  SST  loop  node  L  can  be  connected  as  an  element  of  children^ ,  where 
a  is  an  ACN,  and  L  contains  as  its  single  child  the  actor  A  that  is  to  be  vectorized. 
The  loop  count  associated  with  L  can  then  be  passed  dynamically  to  a  vectorized 
implementation  of  A  to  execute  A  in  a  block-processing  fashion. 
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Chapter  4 

Integration  in  the  DIF  Framework 

In  this  section,  we  discuss  our  approach  to  integrating  topological  pattern  mod¬ 
eling  and  SST-based  schedule  representation  and  analysis  into  the  DIF  framework 
and  the  associated  TDP  software  package.  This  integration  provides  new  capabil¬ 
ities  for  design  and  implementation  of  multimedia  signal  processing  systems  that 
employ  repetitive  graph  structures. 

4.1  Language  Extensions 

TDL  in  the  DIF  framework  is  a  vendor-independent  textual  language  that 
helps  to  transfer  dataflow-based  application  models  across  different  design  tools, 
and  also  serves  as  a  standalone  language  for  specifying  such  models.  TDL  along 
with  TDP  captures  essential  dataflow  modeling  information  and  stores  this  infor¬ 
mation  within  intermediate  representations,  which  can  be  analyzed  and  mapped 
into  implementations  on  different  platforms. 

We  have  extended  TDL  to  incorporate  support  for  topological  patterns.  This 
extension  allows  topological  pattern  constructions  to  be  specified  as  first-class  citi¬ 
zens  in  the  language.  The  parser  of  TDL  is  generated  by  using  the  SableCC  com¬ 
piler  construction  framework  [6].  We  have  extended  the  TDL  grammar  for  SableCC 
by  defining  syntactic  constructs  and  associated  parsing  actions  for  topological  pat¬ 
tern  instantiations  in  TDL.  Topological  patterns  that  are  currently  supported  by 
TDL  and  defined  as  pattern  keywords  in  the  language  include  chain,  ring,  merge, 
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broadcast,  parallel,  and  butterfly.  Extending  TDL  with  additional  patterns 
is  straightforward,  and  such  extensions  will  be  considered  in  future  versions  of  the 
language  as  additional  kinds  of  patterns  are  identified  as  being  important  in  the 
context  of  relevant  applications. 

A  topological  pattern  is  instantiated  in  a  TDL  specification  with  a  declaration 
of  the  form: 

<edge  declaration  ->  <pattern  keyword>(<node  list>) ; 

Here,  <edge  declaration  effectively  declares  a  set  of  new  edges  in  the  graph 
that  is  being  constructed.  These  edges  can  be  defined  as  scalar  edges  (e.g.,  el,  e2, 

. . . )  or  in  the  form  of  an  array  (e.g.,  el  [2] ).  The  placeholder  <pattern  keyword> 
represents  a  TDL  keyword  that  specifies  the  kind  of  topological  pattern  that  is 
being  instantiated  (e.g.,  chain,  ring,  etc.).  The  placeholder  <node  list>  provides 
a  list  of  nodes  (graph  placeholders  for  dataflow  actors)  that  have  been  previously 
instantiated.  The  topological  pattern  instantiation  construct  instantiates  the  newly 
defined  edges  (i.e. ,  the  edges  listed  in  <edge  declaration)  in  such  a  way  that  they 
connect  pairs  of  nodes  in  <node  list>)  so  that  that  the  resulting  combination  of 
the  nodes  in  <node  list>,  and  edges  in  <edge  declaration  form  the  specified 
type  of  topological  pattern.  The  orderings  of  edges  in  <edge  declaration  and 
nodes  in  <node  list>  are  significant  in  determining  how  specific  nodes  and  edges 
are  linked  to  form  a  new  instance  of  the  specified  pattern. 

As  a  simple  example,  an  instance  of  a  chain  pattern  can  be  specified  using 

the  new  TDL  syntax  as  follows. 

e [3]  ->  chain (nl [0 : 3] ) ; 
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Here,  a  chain  pattern  is  created  by  linking  four  nodes,  nl  [0] ,  nl  [1] ,  nl  [2] ,  and 
n[3]  with  the  three  newly  instantiated  edges  e[0],  e  [1] ,  and  e  [2] .  Figure  4.1 
shows  instantiation  examples  for  all  of  the  patterns  we  have  supported  so  far  and 
their  corresponding  TDL  specifications. 

4.2  SST  Plug-In  for  the  DIF  Package 

We  have  implemented  a  new  plug-in  to  the  DIF  framework  that  allows  de¬ 
signers  to  construct  SSTs  for  schedules  associated  with  dataflow  graphs  that  are 
specified  in  TDL,  and  that  employ  arbitrary  numbers  of  topological  pattern  instan¬ 
tiations.  This  plug-in  integrates  the  SST  formulations  developed  in  Section  3  as  a 
new  internal  representation  format  and  associated  set  of  manipulations  within  the 
DIF  framework. 

This  plug-in  is  built  based  on  two  Java  classes  in  DIF  called  Scalable- 
ScheduleTree  and  ScalableScheduleTreeNode.  An  object  that  is  instantiated 
from  the  ScalableScheduleTreeNode  class  can  be  in  the  form  of  either  a  leaf  node 
or  an  internal  node,  where  the  internal  node  can  be  configured  with  an  iteration 
count  or  specified  as  an  ACN  node.  A  leaf  node  instance  is  associated  with  an 
actor  from  the  original  dataflow  graph.  An  ACN  node  instance  has  private  fields 
that  store  the  values  of  cinit ~ ,  cstep. ,  and  climitz,  as  defined  in  Section  3.  The 
ScalableScheduleTree  class  provides  methods  that  allow  designers  to  construct 
SSTs. 

The  following  examples  illustrate  how  the  SST  that  is  shown  in  Fig.  3.1  can 
be  constructed  using  the  proposed  tools. 
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e[3]  ->  chain(nl[0:3]) 


e[8]  ->  butterfly(nl[0:3],  n2[0:3]) 


Figure  4.1:  Example  of  topological  pattern  instantiations  shown  in  terms  of  TDL 
code,  and  illustrations  of  the  resulting  edge  instantiations  together  with  their  inci¬ 
dent  nodes. 
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Construction  of  a  new  SST  subtree  (sstl)  is  illustrated  as  follows.  This  SST  is 
rooted  at  a  node  that  is  instantiated  from  the  ScalableScheduleTreeNode  class  and 
configured  with  an  iteration  count  value  of  2.  An  ACN  node  (labeled  as  A_ACN ) 
is  also  instantiated  without  any  child  node  and  added  as  a  child  node  of  the  root 
node  of  sstl. 

ScalableScheduleTree  sstl  =  new  ScalableScheduleTreeO ; 
sstl . addACNO'A" ,  0,  2); 

Construction  of  another  SST  subtree  (sst2)  is  illustrated  as  follows.  This  SST 
is  rooted  at  an  ACN  node  (labeled  as  B_ACN )  that  is  instantiated  with  5  children 
nodes  from  the  ScalableScheduleTreeNode  class,  and  added  as  a  child  node  of  the 
root  node  of  sst2. 

ScalableScheduleTree  sst2  =  new  ScalableScheduleTreeO; 
sstl.addACNC'B",  5); 

Construction  of  a  third  SST  subtree  (sst3)  is  shown  as  follows.  This  SST  is 
rooted  at  a  node  that  is  configured  with  an  iteration  count  value  of  3.  A  leaf  node 
is  added  as  the  child  node  of  the  root  node  of  sst3. 

ScalableScheduleTree  sst3  =  new  ScalableScheduleTreeO; 
sst2.addSchedule("A" ,  3); 

To  create  the  SST  that  is  shown  in  Fig.  3.1,  designers  can  insert  the  sst2 

subtree  as  the  last  child  of  the  root  node  of  the  sstl  subtree  by  using  the  method 

below.  The  same  method  can  then  be  applied  to  insert  the  sst3  subtree  as  the  last 

child  of  the  root  node  of  the  sstl  subtree. 

sstl . insertSchedule (sst2) ; 
sstl . insertSchedule (sst3) ; 
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4.3  Code  Generation  and  TDIF  Version  0.2 


In  the  first  version  of  the  targeted  DIF  (TDIF)  environment,  TDIF  Version  0.1, 
designers  construct  schedules  based  on  programming  interfaces  that  are  automati¬ 
cally  generated  from  the  TDIF  tool  [24],  These  programming  interfaces  provide  a 
consistent,  formal  dataflow  abstraction  layer  between  designer-constructed  schedules 
and  the  actors  that  are  executed  by  the  schedules.  Furthermore,  the  approach  of  au¬ 
tomatically  generating  actor  programming  interfaces  from  target-independent  actor 
interface  specifications  (in  the  TDIF  language)  allows  the  framework  to  be  adapted 
efficiently  to  different  target  languages  (presently,  the  environment  supports  both  C 
and  CUDA). 

Although  the  designer- specified  scheduling  approach  of  TDIF  0.1  generally 
requires  more  effort  compared  to  use  of  automatically  generated  schedules,  it  pro¬ 
vides  significant  flexibility  in  terms  of  optimizing  and  fine-tuning  the  schedules  based 
on  specialized  application  constraints  and  objectives,  and  incorporating  application- 
specific  insights  on  schedule  structure  that  may  not  exploited  by  available  techniques 
for  automated  scheduling. 

In  the  new  version  of  TDIF,  which  we  introduce  here  as  Version  0.2,  we 
have  integrated  specification  and  code  generation  support  for  SSTs.  Thus,  rather 
than  having  designers  specify  schedules  in  terms  of  arbitrary  target-language  code 
that  connects  to  TDIF-generated  actor  interfaces  (as  in  TDIF  0.1),  we  raise  the 
level  of  abstraction  for  schedule  specification  by  allowing  SST-based  specification  of 
schedules,  where  leaf  nodes  in  the  schedule  trees  are  connected  to  the  same  TDIF- 
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generated  interfaces.  SSTs  are  specified  programmatically  using  graph  construction 
APIs  associated  with  the  SST  internal  representation.  Incorporating  such  specifica¬ 
tions  into  TDL  is  a  natural  direction  for  future  work  that  we  plan  to  explore. 

Code  generation  in  TDIF  for  an  SST  is  carried  out  by  applying  depth  first 
search  to  traverse  the  schedule  tree,  and  invoking  a  specialized  code  generation 
module  in  each  visitation  step  depending  on  the  kind  of  node  that  is  visited  (leaf 
node,  SST  loop  node,  or  ACN).  The  code  generated  from  an  SST,  which  implements 
the  scheduler  for  the  given  application,  can  be  linked  together  with  a  top-level  C 
file  that  is  automatically  generated  from  the  TDIF  environment,  and  actor  code 
from  the  associated  actor  library  to  construct  an  executable  that  implements  the 
application. 

Algorithm  2  shows  a  pseudocode  description  of  the  SST  traversal  process  for 
generating  C  code  from  the  TDIF  environment.  Example  of  a  generated  code  seg¬ 
ment  that  implements  a  scheduler  will  be  shown  in  Section  5. 

Figure  4.2  illustrates  the  design  flow  of  TDIF  Version  0.2,  which  incorporates 
specification  and  code  generation  support  for  SSTs  and  parameterized  scheduling. 
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Algorithm  2  Pseudocode  description  of  the  SST  traversal  process  for  code  gener¬ 
ation _ 

x  is  the  root  node  of  a  given  SST 

function  SSTTraversalProcess(x) 
if  (x  is  a  leaf  node)  begin 

Generate  C  code  for  the  TDIFC  run-time  APIs 
for  the  actor  encapsulation. 

end 

else  begin 

if  (x  is  an  ACN)  begin 

Update  the  parameter  settings  for  x. 

Evaluate  cinit,  cstep,  and  climit  of  x 

and  store  values  in  I,  s,  L  ,  respectively. 

for  (i  =  I;  i  <=  L;  i  +=  s)  begin 

Get  y  as  the  array  children  of  x 
SSTTraversalProcess (y [i] ) 

end 


end 

else  begin  //  x  is  a  SST  loop  node 
Evaluate  loop  count  of  x. 

Generate  C  code  for  the  loop  structure  of  x 
for  each  child  node  z  of  x 
SSTTraversalProcess (z) 

end 

end  function 
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Figure  4.2:  TDIF  design  flow. 
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Chapter  5 

Case  Study:  Image  Registration  Application 

To  demonstrate  our  methods  for  representation  of  and  code  generation  from 
schedules  for  dataflow  graphs  that  employ  topological  patterns,  and  to  demonstrate 
also  the  capabilities  of  our  associated  new  plug-in  to  the  DIF  framework,  we  devel¬ 
oped  an  image  registration  application  based  on  the  Scale-Invariant  Feature  Trans¬ 
form  ( SIFT)  algorithm  as  a  case  study  [16].  SIFT  is  a  well-known  algorithm  in 
computer  vision  for  feature  detection  in  and  matching  of  images. 

5.1  Application  Overview 

Image  registration  is  a  process  of  geometrically  aligning  two  or  more  images  of 
the  same  scene  so  that  they  can  be  overlaid  [28].  Here,  one  of  the  images  is  referred  to 
as  the  reference  image  and  the  second  image  is  referred  to  as  the  target  image. 
Figure  5.1  shows  the  design  flow  of  our  proposed  image  registration  system  in  terms 
of  a  dataflow  graph. 

5.1.1  Scale-Invariant  Feature  Transform 

The  SIFT  algorithm  provides  a  method  to  extract  distinctive  scale-  and  rotation- 
invariant  features  from  images.  SIFT  can  be  used  to  perform  feature  matching  be¬ 
tween  images  that  are  taken  from  different  views  of  the  same  scene.  The  dataflow 
graph  in  Figure  5.1  for  the  SIFT  algorithm  consists  of  five  actors.  These  are  actors 
for  Cascade  Gaussian  Filtering,  Difference  of  Gaussian  computation,  Local 
Extrema  Detection,  Post  Processing,  and  Descriptor  Assignment. 
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Figure  5.1:  A  dataflow  graph  model  of  the  image  registration  application. 


The  Cascade  Gaussian  Filtering  actor  implements  a  cascade  Gaussian  fil¬ 
tering  subsystem,  which  contains  a  number  of  Gaussian  filters  with  different  stan¬ 
dard  deviations.  These  filters  produce  a  series  of  Gaussian  filtered  images.  Neighbor¬ 
ing  images  that  are  filtered  by  Cascade  Gaussian  Filtering  (e.g.,  see  Figure  5.2) 
are  subtracted  by  the  Difference  of  Gaussian  actor  to  produce  a  series  of  differ¬ 
ences  of  Gaussian  images.  Then  the  Local  Extrema  Detection  actor  selects  the 
maxima  and  minima  of  difference  of  Gaussian  images  as  key  point  candidates.  Each 
key  point  is  selected  only  if  it  is  larger  or  smaller  than  all  of  its  26  neighbors  (8 
neighboring  pixels  in  the  enclosing  image  and  18  neighboring  pixels  of  the  adjacent 
two  images). 

The  Post  Processing  actor  eliminates  key  points  that  are  localized  near  the 
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original  image 


first 

octave 


next 

octave 


G  Gaussian  Filter  S  Subtractor  D  Downsampler 

Figure  5.2:  Cascade  Gaussian  filtering  and  the  process  for  generating  differences  of 
Gaussian  images. 


boundary  of  the  image  or  localized  along  line  segments  or  curves  across  which  there 
are  large  gradients  in  pixel  intensity.  Orientation  is  assigned  to  each  key  point  as 
well.  Finally,  image  gradient  information  near  the  key  points  are  extracted  and 
stored  as  key  point  descriptors  by  the  Descriptor  Assignment  actor.  We  ported 
MATLAB  implementations  of  the  SIFT  algorithm  [27]  to  the  dataflow  actors  and 
implemented  them  using  C  and  CUDA. 

5.1.2  Key  Points  Matching 


When  performing  feature  matching  between  two  images,  key  point  i  in  an 
image  A  is  matched  to  key  point  j  in  another  image  B  only  if  the  Euclidean  distance 
between  V s  descriptor  and  f  s  descriptor  multiplied  by  a  user  defined  threshold  is 
not  greater  than  the  Euclidean  distance  of  i’ s  descriptor  to  all  other  key  point 
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descriptors. 


5.1.3  Matching  Refinement 

Since  key  points  matching  may  generate  false  matches  between  the  reference 
image  and  the  target  image,  a  refinement  step  is  needed  in  order  to  eliminate  these 
false  matches.  For  such  matching  refinement  computation,  we  applied  the  RANdom 
SAmple  Consensus  (RANSAC)  algorithm  [5].  RANSAC  is  an  iterative  method  to 
estimate  parameters  of  a  mathematical  model  from  a  set  of  observed  data  consisting 
of  both  inkers  and  outliers.  In  our  case,  inliers  are  correct  matches  and  outliers  are 
false  matches. 

The  pseudocode  shown  in  Algorithm  3  outlines  our  implementation  of  the 
RANSAC  algorithm.  Both  iteration  and  threshold  are  parameters  that  can 
be  configured  by  the  designer.  As  an  example,  Figure  5.4  shows  the  key  points 
matching  of  the  two  images  shown  in  Figure  5.3  before  running  RANSAC,  and 
Figure  5.5  shows  the  key  points  matching  after  running  RANSAC. 

5.1.4  Target  Image  Transformation 

As  shown  in  Fig.  5.1,  the  Target  Image  Transformation  actor  performs  the 
computation  of  target  image  transformation  by  taking  the  outputs  produced  by  the 
SIFT  computation,  the  refined  matching  result  and  the  target  image  and  producing 
the  resulting  registered  image.  For  the  computation  of  target  image  transforma¬ 
tion,  we  use  a  rigid  transformation  process,  which  can  be  divided  into  three  steps, 
translation,  rotation  and  scaling  (see  Figure  5.6). 
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Algorithm  3  Outline  of  the  RANSAC  algorithm  as  we  have  applied  it. 

/*  track  the  number  of  match  pairs  in  the  best  inliers  */ 
count_best  =  0; 

for  (i  =  0;  i  <  iteration;  i++)  { 

/*  select  one  match  pair  (Kl,  K2) .  (xl,  yl) 
and  (x2,  y2)  are  coordinates  of  the  two 
key  points  in  terms  of  the  pixel  position  */ 
rand_match  =  randomly  selected  match  pair; 

/*  track  the  number  of  match  pairs  in  the  inliers  */ 
count  =  0; 

for  (j  =0;  j  <  number_matches;  j++)  { 

/*  (K3,  k4)  is  jth  match  pair,  (x3,  y3) 
and  (x4,  y4)  are  their  coordinates  */ 
deltax  =  (xl  -  x2)  -  (x3  -  x4) ; 
deltay  =  (yl  -  y2)  -  (y3  -  y4) ; 

/*  error  measures  how  likely  this  match  pair  is  an  inlier  */ 
error  =  pow (deltax,  2)  +  pow (deltay,  2); 

if  (error  <  threshold)  { 

add  jth  match  pair  into  inliers; 
count++; 

> 

} 

if  (count  >  count_best)  { 
count_best  =  count; 
best_inliers  =  inliers; 

} 

> 
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Figure  5.3:  Original  images  for  RANSAC  example. 


Figure  5.4:  Key  points  matching  before  running  RANSAC. 
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Figure  5.5:  Key  points  matching  after  running  RANSAC. 


reference  image  target  image 


Figure  5.6:  Steps  in  rigid  target  image  transformation. 


31 


1.  In  the  translation  step,  we  use  one  key  points  match  pair  from  the  Matching 
Refinement  step  as  the  basis  of  translation.  This  pair  is  the  randomly  selected 
match  pair  at  the  beginning  of  each  iteration  of  the  RANSAC  algorithm.  The 
reason  for  using  this  pair  as  the  basis  of  the  target  image  transformation  step  is 
that  this  match  pair  is  unlikely  to  be  a  false  match.  Suppose  the  coordinates 
of  this  match  pair  are  (x\,yl)  and  (x2.  y‘2).  Then  the  translation  vector  is 
01  -  x2,yl  -  y2). 

2.  The  computation  of  the  rotation  step  is  carried  out  in  the  polar  coordinate 
system.  The  pole  of  this  polar  coordinate  system  is  the  coordinate  of  the  key 
point  in  the  reference  image  of  the  match  pair  mentioned  in  our  description  of 
the  translation  step.  First,  we  convert  the  coordinates  of  matched  key  points 
from  the  Cartesian  coordinate  system  to  the  polar  coordinate  system.  Then 
we  determine  the  rotation  angle  9  using  the  key  point  matching  information. 

3.  The  Scaling  step  is  computed  in  the  polar  coordinate  system.  Since  the  target 
image  has  been  rotated,  each  key  points  match  pair  should  be  aligned  with 
the  pole.  Therefore,  the  ratio  of  scaling  is  the  ratio  of  the  radius  of  the  key 
points  match  pair. 

Now  that  we  have  the  translation  vector,  rotation  angle  and  scaling  ratio,  we 
can  use  them  to  determine  the  corresponding  positions  (in  the  target  image)  of  each 
pixel  in  the  resulting  image.  These  positions  are  coordinates  with  fractions.  We  use 
bilinear  interpolation  to  determine  each  pixel  value  in  the  resulting  image  by  taking 
weighted  average  values  of  four  surrounding  pixels  in  the  target  image  to  reduce 
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visual  distortion. 


5.2  Applying  the  Scalable  Schedule  Tree 

Cascade  Gaussian  filtering  is  a  relevant  case  study  for  experimenting  with 
topological  patterns  and  SSTs  because  it  can  be  modeled  naturally  in  terms  of 
parameterized  topologies.  Here,  we  model  the  cascade  Gaussian  filtering  actor  as  a 
subsystem.  It  can  be  modeled  as  a  dataflow  graph  consisting  of  actors  that  perform 
Gaussian  filtering  and  downsampling  computations.  These  computations  can  be 
divided  into  a  set  of  o  groups,  such  that  each  group  involves  s  filtering  steps.  Both 
o  and  s  are  parameters  that  can  be  configured  by  the  designer  (e.g.,  to  explore 
trade-offs  between  processing  complexity  and  image  processing  accuracy). 

In  the  cascade  Gaussian  filtering  process  illustrated  in  Figure  5.2,  the  origi¬ 
nal  image  is  convolved  with  the  first  filter.  The  filtered  image  is  saved  and  then 
convolved  with  the  next  filter,  and  so  on.  After  one  group  of  filtering  operations  is 
carried  out,  s  different  blurred  Gaussian  images  are  labeled  as  a  separate  octave. 
The  next  step  is  to  downsample  the  last  image  of  the  previous  octave  by  a  factor 
of  two.  This  process,  as  shown  in  Figure  5.2,  repeats  until  o  octaves  of  images  are 
produced. 

The  topological  pattern  underlying  this  subsystem  with  o  =  6  and  s  =  6  is  a 
chain  (linear  arrangement  of  actors)  that  can  be  specified  using  the  TDL  code  shown 
in  Program  1.  Here,  an  array  of  40  edges  is  instantiated  by  connecting  41  specified 
nodes  (six  groups  of  six  nodes  each  that  are  interleaved  with  five  individual  nodes) 
in  a  chain. 
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Program  1  TDL  code  for  cascade  Gaussian  filtering. 

topology  { 

nodes  =  G[36]  ,  D[5]  ; 

edges  =  e [40]  ->  chain(G [0 : 5] ,  D[0]  , 

G  [6 : 11]  ,  D[l], 

G [12 : 17]  ,  D  [2]  , 

G [18 : 23]  ,  D  [3]  , 

G [24 : 29]  ,  D  [4]  , 
G[30:35] ) ; 


Note  that  the  binding  of  nodes  to  specific  functions  is  done  in  a  separate  part 
of  the  TDL  specification  that  is  dedicated  to  assigning  actor  attributes.  This  part  of 
the  specification  is  not  shown  for  conciseness  (for  details,  we  refer  the  reader  to  [9]). 

In  this  example  of  cascade  Gaussian  filtering,  since  both  o  and  s  are  param¬ 
eters  that  can  be  configured,  one  can  naturally  derive  a  nested  SST  as  shown  in 
Figure  5.7.  Such  a  representation  provides  a  formal,  target-language- independent 
model  of  schedule  structure  that  can  be  applied  to  coordinate  execution  for  this 
subsystem  in  a  manner  that  is  parameterized  across  two  dimensions. 

In  the  case  that  o  —  6  and  s  =  6  (as  shown  in  Figure  5.7),  the  cascade 
Gaussian  filter  ACN  has  11  children  nodes,  which  include  6  nested  ACNs,  each 
labeled  as  filter,  and  5  downsampler  actors  encapsulated  as  leaf  nodes,  which  are 
labeled  as  D[0] ,  D[l] ,  . . . ,  D[4] .  Each  of  these  leaf  nodes  represents  an  encapsu¬ 
lation  of  a  downsampler  actor  in  the  cascade  Gaussian  filtering  application.  Each 
internal  node  labeled  filter  is  an  ACN  that  contains  6  children  nodes,  where  each 
of  these  children  nodes  represents  an  encapsulation  of  a  Gaussian  filtering  actor 
in  the  application.  The  Java  code  shown  in  Program  2  demonstrates  how  this  SST 
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G [0]  )  -  QG[5J^)  CJ3[6]^  -  CJ5[11ET  G[30]  -  G[35] 

Arrayed  Children  Node  (ACN)  G  Gaussian  filter  D  Downsampler 


Figure  5.7:  SST  representation  for  the  cascade  Gaussian  filtering  application. 


can  be  built  by  using  the  SST  plug-in  that  is  introduced  in  Section  4. 

The  generated  code  segment  from  the  SST  for  the  cascade  Gaussian  filter¬ 
ing  application  in  the  TDIF  environment  is  shown  as  Program  3.  In  this  code 
segment,  tdif  c_ec_enable_check  is  the  TDIF  run-time  API  that  implements  the 
enable  method  to  test  for  sufficient  input  data  for  execution  of  a  given  actor,  and 
tdif  c_ec_invoke  is  the  TDIF  run-time  API  that  implements  the  invoke  method 
to  execute  a  single  invocation  for  that  actor.  Also,  tdif  c_lib_<A>_ec  and  tdifc_- 
lib_<A>_tc  implement  the  execution  context  and  the  topological  context ,  respectively, 
which  are  instances  of  retargetable  data  structures  that  encapsulate  relevant  state 
information  of  an  actor  <A>. 

To  learn  details  on  the  enable  method  and  the  invoke  method,  we  refer  the 
reader  to  [19].  For  details  on  the  execution  context  and  topological  context, 
we  refer  the  reader  to  [24], 

5.3  Evaluation  in  Terms  of  Coding  Efficiency 


Our  design  framework  for  specifying  topological  patterns  enables  concise  and 
scalable  representation  of  multimedia  applications.  To  help  quantify  this  kind  of 
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Program  2  Java  code  for  building  an  SST  for  cascade  Gaussian  filtering  using  our 
SST  plug-in. 

/*  parameters  */ 
int  o  =  6,  s  =  6; 

/*  cascade  Gaussian  filter  ACN  */ 

ScalableScheduleTree  cgf 

=  new  ScalableScheduleTree () ; 
cgf . addACN ( "CGF" ,  0); 

/*  filters  ACN  */ 

ScalableScheduleTree  filters  []  =  new  ScalableScheduleTree [o] ; 

/*  downsamplers  */ 

ScalableScheduleTree  d[]  =  new  ScalableScheduleTree [s-1] ; 

for  (int  i  =  0;  i  <  o-l;  i++)  { 

filters [i]  =  new  ScalableScheduleTree () ; 

f ilters [i] . addACN ("G" ,  s) ; 

cgf . insertSchedule (f ilters [i] ) ; 

d[i]  =  new  ScalableScheduleTree () ; 

d [i] . addSchedule("D") ; 

cgf . insertSchedule (d [i] ) ; 

> 

filters[o-l]  =  new  ScalableScheduleTree () ; 

filters [o-l] .addACN ("G",  s) ; 

cgf . insertSchedule (filters  [o-l] ) ; 
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Program  3  A  segment  of  code  that  is  generated  in  the  TDIF  environment  from 

the  SST  for  the  cascade  Gaussian  filtering  application. 

if  (tdif c_ec_enable_check(tdif c_lib_gO_ec ,  tdif c_lib_gO_tc) )  { 
tdif c_ec_invoke (tdif c_lib_gO_ec,  tdif c_lib_gO_tc) ; 

> 

if  (tdif c_ec_enable_check(tdif c_lib_g5_ec ,  tdif c_lib_g5_tc) )  { 
tdif c_ec_invoke (tdif c_lib_g5_ec,  tdif c_lib_g5_tc) ; 

> 

if  (tdif c_ec_enable_check(tdif c_lib_dO_ec ,  tdif c_lib_dO_tc) )  { 
tdif c_ec_invoke (tdif c_lib_dO_ec,  tdif c_lib_dO_tc) ; 

> 

if  (tdif c_ec_enable_check(tdif c_lib_g6_ec ,  tdif c_lib_g6_tc) )  { 
tdif c_ec_invoke (tdif c_lib_g6_ec,  tdif c_lib_g6_tc) ; 

> 

if  (tdif c_ec_enable_check(tdif c_lib_gll_ec,  tdif c_lib_gll_tc) )  { 
tdifc_ec_invoke(tdifc_lib_gll_ec,  tdif c_lib_gll_tc) ; 

> 

if  (tdif c_ec_enable_check(tdif c_lib_d4_ec ,  tdif c_lib_d4_tc) )  { 
tdif c_ec_invoke (tdif c_lib_d4_ec,  tdif c_lib_d4_tc) ; 

> 

if  (tdif c_ec_enable_check(tdif c_lib_g30_ec,  tdif c_lib_g30_tc) )  { 
tdif c_ec_invoke (tdif c_lib_g30_ec ,  tdif c_lib_g30_tc) ; 

> 

if  (tdif c_ec_enable_check(tdif c_lib_g35_ec,  tdif c_lib_g35_tc) )  { 
tdif c_ec_invoke (tdif c_lib_g35_ec,  tdif c_lib_g35_tc) ; 

> 
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benefit,  we  apply  an  evaluation  metric  called  the  lines  of  code  (LOC),  which  is  the 
number  of  lines  of  code  required  for  an  application.  Unless  otherwise  specified, 
the  LOC  cost  refers  to  code  that  the  designer  needs  to  manually  provide  (e.g.,  in 
contrast  to  code  that  is  automatically  generated  or  reused  from  some  other  part  of 
an  implementation).  We  apply  this  metric  on  various  applications,  including  the 
cascade  Gaussian  filtering  application,  that  are  specified  with  and  without  use  of 
topological  patterns.  Note  that  use  of  the  LOC  metric  is  facilitated  by  employing 
lines  that  have  reasonably  consistent  complexity  —  we  have  tried  to  follow  such  an 
approach  in  our  comparisons.  A  more  accurate  metric  along  these  lines  would  be  to 
compare  the  numbers  of  lexical  tokens.  Exploration  of  such  a  more  detailed  metric 
is  an  interesting  direction  for  further  study. 

5.3.1  LOC  Evaluation  for  Topological  Patterns 

We  first  compare  LOC  evaluation  results  by  using  TDL  with  and  without  the 
support  of  topological  patterns.  Table  5.1  shows  a  comparison  result  in  terms  of 
LOC  for  TDL  specifications  with  and  without  the  support  of  topological  patterns 
for  different  applications.  For  the  specifications  in  this  comparison,  each  node  and 
edge  declaration  occupies  a  separate  line  of  code.  As  an  example,  Program  4  and 
Program  5  shows  the  TDL  specifications  of  the  image  registration  application  (see 
Figure  5.1)  without  support  for  topological  patterns  and  with  support  for  topological 
patterns  respectively. 
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Program  4  TDL  code  for  the  image  registration  application  without  support  for 

topological  patterns. 

topology  { 

nodes  =  IR_r,  IR_t,  CGF_r ,  CGF_t ,  D0G_r ,  D0G_t , 

LED_r,  LED_t,  PP_r,  PP_t,  DA_r,  DA_t, 

KPM,  MR,  TIT,  IW; 

edges  =  eO(IR_r,  CGF_r) , 
el (IR_t ,  CGF_t) , 
e2(CGF_r,  D0G_r) , 
e3(CGF_t ,  D0G_t) , 
e4(D0G_r,  LED_r) , 
e5(D0G_t,  LED_t) , 
e6(LED_r,  PP_r) , 
e7(LED_t,  PP_t), 
e8(PP_r ,  DA_r) , 
e9(PP_t,  DA_t) , 
elO(DA_r ,  KPM), 
ell (DA_t ,  KPM), 
el2(CGF_r ,  PP_r) , 
el3(CGF_t ,  PP_t), 
el4(CGF_r ,  DA_r) , 
el5(CGF_t ,  DA_t) , 
el6(D0G_r ,  PP_r) , 
el7 (D0G_t ,  PP_t), 
e 18 (KPM,  MR), 
el9(MR,  TIT), 
e20(TIT,  IW), 
e21 (DA_r ,  MR), 
e22(DA_t ,  MR), 
e23(DA_r ,  TIT), 
e24(DA_t ,  TIT), 
e25(IR_t ,  TIT), 
e26(IR_t ,  IW); 
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Program  5  TDL  code  for  the  image  registration  application  with  support  for  topo¬ 
logical  patterns. 

topology  { 

nodes  =  REF [6],  TAR [6] ,  REGIST[4] ; 
edges  =  eO  [9]  ->  chain (REF [0 : 5] ,  REGIST [0 : 3] ) , 
el [6]  ->  chain(TAR[0:5] ,  REGIST [0]), 
e2 [2]  ->  broadcast (REF [1] ,  REF [4: 5]), 
e3 [2]  ->  broadcast (TAR [1] ,  TAR [4: 5]), 
e4(REF[2]  ,  REF  [4]), 
e5(TAR[2]  ,  TAR  [4]), 

e6 [2]  ->  broadcast (REF [5] ,  REGIST [1 : 2] ) , 
e7 [2]  ->  broadcast (TAR [5] ,  REGIST [1 : 2] ) , 
e8 [2]  ->  broadcast (TAR [0] ,  REGIST [2 : 3] ) ; 


5.3.2  LOC  Evaluation  for  TDIF  Framework 

We  also  assess  the  LOC  benefit  for  the  cascade  Gaussian  filtering  application 
that  is  obtained  from  code  generation  in  the  TDIF  environment.  More  specifically, 
we  compare  the  LOC  cost  of  an  implementation  that  uses  code  generation  and  the 
LOC  cost  of  the  generated  code  (i.e.,  the  LOC  cost  of  the  generated  implementation). 
This  gives  a  comparison  of  the  complexity  of  the  complete  implementation  generated 
using  TDIF  compared  to  the  complexity  of  the  code  that  the  designer  has  to  write 
and  maintain  as  source  code. 

As  discussed  in  Section  4,  TDIF  Version  0.2  contains  a  code  generator  to 
translate  SSTs  into  C  code  that  implements  the  corresponding  schedules.  The 
TDIF  environment  also  provides  tools  to  translate  concise  specifications  of  actor 
interface  information  (input,  output,  state,  etc.)  into  APIs  for  implementing  the 
actors  according  to  standardized  dataflow  implementation  structures  in  TDIF  [24], 
Additionally,  the  TDIF  environment  provides  translation  from  DIF  specifications 
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Table  5.1:  LOC  comparisons  for  TDL  specifications  with  and  without  the  support 
of  topological  patterns  (TPs). 


Application 

without  TP 

with  TP 

Cascade  Gaussian  filter 

81 

3 

Image  registration 

43 

12 

JPEG  encoder 

37 

9 

FFT  (size  N  =  8) 

32 

2 

into  top-level  C  language  implementations  that  construct  and  execute  the  specified 
dataflow  graphs. 

Table  5.2  summarizes  the  LOC  costs  for  different  implementation  components 
for  the  cascade  Gaussian  filter  application  when  code  generation  is  used  —  i.e., 
these  are  the  costs  for  the  designer-written  code  that  can  be  viewed  as  input  to  the 
TDIF  toolset.  These  costs  are  listed  as  functions  of  the  numbers  of  dataflow  graph 
actors  n  and  edges  e  in  the  scalable  application,  and  the  total  LOC  costs  c  in  the 
designer- written  component  of  the  actor  implementations. 

On  the  other  hand,  Table  5.3  shows  the  LOC  costs  of  the  complete  generated 
implementation  —  i.e.,  the  generated  code  together  with  the  designer- written  TDIF 
input  code  that  is  used  directly  (without  translation)  in  the  implementation. 

Comparing  the  LOC  listings  in  the  two  tables,  we  see  that  as  the  number  of 
nodes  n  in  the  application  is  increased,  the  ratio  of  the  designer-written  LOC  cost 
to  the  complete  implementation  LOC  cost  decreases.  This  helps  to  quantify  the 
utility  of  the  TDIF  tool  in  terms  of  LOC  costs  as  a  function  of  graph  complexity. 
This  comparison  incorporates  the  use  of  topological  patterns,  which  help  to  reduce 
the  LOC  cost  for  the  top-level  DIF  specification. 


41 


Table  5.2:  LOC  costs  for  designer-written  code  in  the  TDIF  environment. 


Top-level  DIF  specification 

5n+e+6 

TDIF  specification 

5n 

Building  SST 

16 

Actor  development 

c 

Total 

10n+e+22+c 

Table  5.3:  LOC  costs  for  the  implementation  generated  by  the  TDIF  environment. 


Top-level  C  file 

9n+6 

Function  declaration 

56n 

Scheduling  APIs 

22n 

Scheduling  file  header 

2n+5 

Scheduling 

41n 

Actor  development 

c 

Total 

130n+ll+c 

5.4  Evaluation  in  Terms  of  Execution  Time 
5.4.1  TDP  Processing  Time 

As  shown  in  Section  5.3,  support  for  topological  patterns  notably  reduces  the 
amount  of  input  a  designer  needs  to  provide  when  using  TDL  to  specify  a  system. 
In  this  section,  we  evaluate  the  TDP  processing  time  with  and  without  support 
for  topological  patterns.  Here,  by  TDP  processing  time ,  we  mean  the  execution 
time  of  TDP  in  reading  the  TDL  specification  file  and  storing  the  dataflow  graph 
information  within  intermediate  representations. 

Table  5.4  shows  our  comparison  results.  The  processing  time  is  slightly  faster 
for  TDP  with  support  for  topological  patterns.  The  input  TDL  specification  specifies 
the  dataflow  graph  of  the  image  registration  application  shown  in  Figure  5.1.  The 
corresponding  TDL  code  is  shown  in  Program  4  and  Program  5.  The  results  are 
obtained  according  to  the  average  execution  time  for  100  runs  in  each  of  the  two 
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Table  5.4:  Execution  time  for  reading  the  TDL  specification  file  and  storing  the 
dataflow  graph  information  within  appropriate  intermediate  representations. 


without  support  for  TPs  (sec)  with  support  for  TPs  (sec) 

0.973  0.943 

Table  5.5:  Image  registration  application  execution  time  for  the  dataflow-based  im¬ 
plementation  in  the  TDIF  environment  and  a  conventional  implementation  (without 
dataflow-based  modeling) . 

Implementation  in  TDIF  (sec)  Plain  implementation  (sec) 

30.523  30.476 

cases. 

5.4.2  Application  Execution  Time 

In  this  section,  we  compare  the  image  registration  application  execution  time 
for  the  dataflow-based  implementation  (see  Figure  5.1)  in  the  TDIF  environment, 
and  the  “conventional”  implementation  without  dataflow-based  modeling.  Table  5.5 
shows  the  comparison  results.  The  input  images  are  1200  x  900  gray-scale  bitmap 
images.  The  results  are  obtained  according  to  the  average  execution  time  for  10 
runs  in  each  of  the  two  cases.  We  see  that  the  execution  times  of  the  two  cases  are 
very  close,  which  means  that  the  coding  efficiency  of  our  new  modeling  approach 
does  not  come  at  significant  performance  cost  when  using  this  application  in  our 
design  framework. 
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5.5  Cross-Platform  Experimentation 
5.5.1  Cascade  Gaussian  Filtering 

TDIF  includes  capabilities  for  targeting  CUDA-enabled  graphics  processing 
units  (GPUs)  in  addition  to  pure  C  code  (“CPU  targeted”)  implementations  [24], 
As  part  of  this  application  case  study,  we  experimented  with  the  CUDA-targeted 
synthesis  capability  of  TDIF  for  the  cascaded  Gaussian  filter  application.  As  our 
experiments  show,  parts  of  the  application  are  a  good  match  for  GPU  execution, 
and  thus,  the  synthesized  GPU  implementation  exhibits  significant  performance 
improvement.  This  aspect  of  our  case  study  validates  the  utility  of  topological 
patterns  and  the  developed  tool  chain  in  enhancing  application  specification  and 
scalability  in  the  context  of  cross-platform  experimentation  to  explore  trade-offs  on 
alternative  targets.  Linkage  to  such  experimentation  capabilities  is  important  for 
multimedia-oriented  tools  since  there  is  a  wide  variety  of  relevant  platforms  available 
for  multimedia  system  implementation. 

In  these  experiments,  input  to  the  application  is  a  1200  x  900  gray-scale  bitmap 
image,  and  the  implementations  are  executed  on  a  3GHz  PC  with  an  Intel  CPU 
that  is  equipped  with  4GB  RAM,  and  co-located  with  an  NVIDIA  GTX260  GPU. 
Table  5.6  shows  a  performance  comparison  of  CPU-targeted  and  GPU-targeted  im¬ 
plementations  for  the  cascade  Gaussian  filtering  application.  Both  implementations 
were  generated  by  TDIF  based  on  SSTs  that  exploit  topological  pattern  structures 
in  the  application  specifications.  The  results  are  obtained  according  to  the  average 
execution  time  for  100  runs  in  each  of  the  two  cases. 
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Table  5.6:  Performance  comparison  for  CPU-targeted  and  GPU-targeted  cascade 
Gaussian  filtering  implementations . _ 


CPU  (sec) 

GPU  (sec) 

Speedup 

11.79282 

0.46281 

25.48 

The  results  show  that  GPU  acceleration  provides  significant  benefit  in  this 
application,  and  validates  the  retargetability  of  our  use  of  topological  patterns  and 
SSTs  in  TDIF.  Use  of  the  TDIF  environment  allows  us  to  obtain  such  a  comparison 
with  relatively  high  coding  efficiency,  and  a  correspondingly  high  degree  of  automa¬ 
tion,  as  demonstrated  in  Section  5.3.  This  is  due  to  the  high  level  of  abstraction  and 
accompanying  formal  modeling  capabilities  provided  by  TDIF  and  the  associated 
TDL  programming  features.  Use  of  topological  patterns  helps  to  enhance  the  coding 
efficiency  and  raise  the  level  of  abstraction  further  by  representing  applications  in 
terms  of  scalable,  higher  level  constructs  that  are  complementary  to  conventional 
forms  of  hierarchy,  which  are  employed  in  related  kinds  of  dataflow  specifications. 

5.5.2  Image  Registration  Results 

In  this  section,  we  show  experimental  results  for  the  whole  image  registration 
application  and  provide  a  performance  comparison  for  CPU-targeted  and  GPU- 
targeted  implementations.  In  these  experiments,  the  cascade  Gaussian  filter  is  no 
longer  modeled  as  a  system  that  contains  many  actors.  It  is  just  one  actor  in  the 
overall  image  registration  application  illustrated  in  Figure  5.1.  We  demonstrate  im¬ 
age  registration  results  with  two  examples.  Figure  5.8,  Figure  5.9  and  Figure  5.10 
show  the  reference  image,  target  image  and  result  image  for  the  first  example,  re¬ 
spectively.  Similarly,  Figure  5.11,  Figure  5.12  and  Figure  5.13  show  the  reference 
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Figure  5.8:  Reference  image  for  Example  1. 


image,  target  image  and  result  image  for  the  second  example,  respectively. 

Table  5.7  shows  a  performance  comparison  between  CPU-targeted  and  GPU- 
targeted  implementations  for  the  GPU-targetable  actors  and  the  overall  image  regis¬ 
tration  application.  Inputs  to  the  application  are  again  1200  x  900  gray-scale  bitmap 
images,  and  the  implementations  are  executed  on  a  3GHz  PC  with  an  Intel  CPU 
that  is  equipped  with  4GB  RAM,  and  co-located  with  an  NVIDIA  GTX260  GPU. 
The  results  are  obtained  according  to  the  average  execution  time  for  10  runs  in  each 
of  the  two  cases. 
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Figure  5.9:  Target  image  for  Example  1. 


Figure  5.10:  Result  image  for  Example  1. 
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Figure  5.11:  Reference  image  for  Example  2. 


Next  Generation  ( )f>en  Source  Version  Control 

Version  Control  with 


^Subversion 


O  REILLY' 


Figure  5.12:  Target  image  for  Example  2. 
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Figure  5.13:  Result  image  for  Example  2. 


Table  5.7:  Performance  comparison  between  CPU-targeted  and  GPU-targeted  im¬ 
plementations  for  the  GPU-targetable  actors  and  the  overall  image  registration  ap- 
plicatiom _ 


CPU  (sec) 

GPU  (sec) 

Speedup 

Cascade  Gaussian  filter 

11.896 

0.416 

28.60 

Difference  of  Gaussian 

0.584 

0.012 

48.67 

Target  image  transformation 

0.614 

0.017 

36.12 

Whole  application 

55.575 

30.523 

1.82 
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Chapter  6 

Conclusions  and  Future  Work 

6.1  Conclusions 

In  this  thesis,  we  have  presented  a  novel  schedule  model  called  the  scalable 
schedule  tree  (SST)  for  representing  parameterized  schedule  structures  based  on 
topological  patterns.  We  have  also  presented  language  extensions  for  specifying 
topological  patterns  and  a  new  plug-in  to  the  dataflow  interchange  format  (DIF) 
framework  for  specifying  SSTs  that  execute  dataflow  models  with  topological  pat¬ 
terns,  and  for  generating  C  code  that  implements  the  parameterized  schedules  repre¬ 
sented  by  these  SSTs.  Through  a  case  study  centered  around  an  image  registration 
application,  we  have  validated  our  new  methods  and  tools,  and  demonstrated  their 
utility  in  the  design  and  implementation  of  multimedia  systems. 

6.2  Future  Work 

Useful  directions  for  further  work  include  the  following: 

•  developing  techniques  for  automated  derivation  of  SSTs; 

•  exploring  SSTs  that  incorporate  more  complex  forms  of  adaptivity; 

•  supporting  code  generation  on  additional  classes  of  platforms,  such  as  field 
programmable  gate  arrays  and  multicore  digital  signal  processors; 

•  incorporating  into  TDL  the  SST  plug-in  that  we  have  developed  in  this  work; 

•  extending  TDL  with  additional  topological  patterns;  and 
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•  further  development  of  non-rigid  image  registration  applications  based  on  the 
scale-invariant  feature  transform  (SIFT)  algorithm. 
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